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I. "SEND THREE- AND FOUR-PENCE, WE'RE 
GOING TO A DANCE" 

This phrase was heard, it is claimed, over the radio 
during WWI instead of the transmitted tactical phrase 
"Send reinforcements we're going to advance" [1]. As 
illustrative as it is apocryphal, this garbled yet compre- 
hensible transmission sets the tone for our investigations 
here. Namely, what happens to knowledge when it is 
communicated sequentially along a chain, from one indi- 
vidual to the next? What fidelity can one expect? How 
is information lost? How do innovations occur? 

To answer these questions we introduce a theory of se- 
quential causal inference in which learners in a commu- 
nication chain estimate a structural model from their up- 
stream "teacher" and then, using that model, pass along 
samples to their downstream "student" . This reminds 
one of the familiar children's game Telephone. By way of 
quickly motivating our sequential learning problem, let's 
briefly recall how the game works. 

To begin, one player invents a phrase and whispers 
it to another player. This player, believing they have 
understood the phrase, then repeats it to a third and 
so on until the last player is reached. The last player 
announces the phrase, winning the game if it matches 
the original. Typically it does not, and that's the fun. 
Amusement and interest in the game derive directly from 
how the initial phrase evolves in odd and surprising ways. 
The further down the chain, the higher the chance that 
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errors will make recovery impossible and the less likely 
the original phrase will survive. 

The game is often used in education to teach the lesson 
that human communication is fraught with error. The fi- 
nal phrase, though, is not merely accreted error but the 
product of a series of attempts to parse, make sense, and 
intelligibly communicate the phrase. The phrase's evo- 
lution is a trade off between comprehensibility and ac- 
cumulated distortion, as well as the source of the game's 
entertainment. We employ a much more tractable setting 
to make analytical progress on sequential learning, based 
on computational mechanics [2-4] , intentionally selecting 
a simpler language system and learning paradigm than 
likely operates with children. 

Specifically, we develop our theory of sequential learn- 
ing as an extension of the evolutionary population dy- 
namics of genetic drift, recasting Kimura's selectively 
neutral theory [5] as a special case of a generalized drift 
process of structured populations with memory. This 
is a substantial departure from the unordered popula- 
tions used in evolutionary biology. Notably, this requires 
a new and more general information-theoretic notion of 
fixation. We examine the diffusion and fixation prop- 
erties of several drift processes, demonstrating that the 
space of drift processes is highly organized. This organi- 
zation controls fidelity, facilitates innovations, and leads 
to information loss in sequential learning and evolution- 
ary processes with and without memory. We close by 
describing applications to learning, inference, and evolu- 
tion, commenting on related efforts. 

To get started, we briefly review genetic drift and fix- 
ation. This will seem like a distraction, but it is a nec- 
essary one since available mathematical results are key. 



Then wc introduce in detail our structured variants of 
these concepts — defining the generalized drift process and 
formulating a generalized definition of fixation appropri- 
ate to it. With the background laid out, we begin to 
examine the complexity of structural drift behavior. Wc 
demonstrate that it is a diffusion process within a space 
that decomposes into a connected network of structured 
subspaces. Building on this decomposition, we explain 
how and when processes jump between these subspaces — 
innovating new structural information or forgetting it — 
thereby controlling the long-time fidelity of the commu- 
nication chain. We then close by outlining future re- 
search and listing several potential applications for struc- 
tural drift, drawing out consequences for evolutionary 
processes that learn. 

Those familiar with neutral evolution theory are urged 
to skip to Sec. V, after skimming the next sections to 
pick up our notation and extensions. 



II. FROM GENETIC TO STRUCTURAL DRIFT 

Genetic drift refers to the change over time in geno- 
type frequencies in a population due to random sampling. 
It is a central and well studied phenomenon in popula- 
tion dynamics, genetics, and evolution. A population of 
genotypes evolves randomly due to drift, but typically 
changes are neither manifested as new phenotypcs nor 
detected by selection — they are selectively neutral. Drift 
plays an important role in the spontaneous emergence 
of mutational robustness [6, 7], modern techniques for 
calibrating molecular evolutionary clocks [8], and non- 
adaptive (neutral) evolution [9, 10], to mention only a 
few examples. 

Selectively neutral drift is typically modeled as a 
stochastic process: A random walk that tracks finite pop- 
ulations of individuals in terms of their possessing (or 
not) a variant of a gene. In the simplest models, the 
random walk occurs in a space that is a function of geno- 
types in the population. For example, a drift process can 
be considered to be a random walk of the fraction of indi- 
viduals with a given variant. In the simplest cases there, 
the model reduces to the dynamics of repeated binomial 
sampling of a biased coin, in which the empirical estimate 
of bias becomes the bias in the next round of sampling. 
In the sense we will use the term, the sampling process 
is memoryless. The biased coin, as the population being 
sampled, has no memory: The past is independent of the 
future. The current state of the drift process is simply 
the bias, a number between zero and one that summa- 
rizes the state of the population. 

The theory of genetic drift predicts a number of mea- 
surable properties. For example, one can calculate the 
expected time until all or no members of a population 
possess a particular gene variant. These final states arc 
referred to as fixation and deletion, respectively. Vari- 
ation due to sampling vanishes once these states are 
reached and, for all practical purposes, drift stops. From 



then on, the population is homogeneous; further sam- 
pling can introduce no genotypic variation. These states 
are fixed points — in fact, absorbing states — of the drift 
stochastic process. 

The analytical predictions for the time to fixation and 
time to deletion were developed by Kimura and Ohta 
[5, 11] in the 1960s and are based on the memoryless mod- 
els and simplifying assumptions introduced by Wright 
[12] and Fisher [13] in the early 1930s. The theory has 
advanced substantially since then to handle more realistic 
models and to predict additional effects due to selection 
and mutation. These range from multi-allele drift mod- 
els and F-statistics [14] to pseudohitchhiking models of 
"genetic draft" [15]. 

The following explores what happens when we relax 
the memoryless restriction. The original random walk 
model of genetic drift forces the statistical structure at 
each sampling step to be an independent, identically dis- 
tributed (IID) stochastic process. This precludes any 
memory in the sampling. Here, we extend the IID the- 
ory to use time-varying probabilistic state machines to 
describe memoryful population sampling. 

In the larger setting of sequential learning, we will show 
that memoryful sequential sampling exhibits structurally 
complex, drift-like behavior. We call the resulting phe- 
nomenon structural drift. Our extension presents a num- 
ber of new questions regarding the organization of the 
space of drift processes and how they balance structure 
and randomness. To examine these questions, we require 
a more precise description of the original drift theory. 



III. GENETIC DRIFT 

We begin with the definition of an allele, which is one of 
several alternate forms of a gene. The textbook example 
is given by Mendel's early experiments on heredity [16], 
in which he observed that the fiowers of a pea plant were 
colored either white or violet, this being determined by 
the combination of alleles inherited from its parents. A 
new, mutant allele is introduced into a population by the 
mutation of a wild-type allele. A mutant allele can be 
passed on to an individual's offspring who, in turn, may 
pass it on to their offspring. Each inheritance occurs with 
some probability. 

Genetic drift, then, is the change of allele frequencies 
in a population over time. It is the process by which 
the number of individuals with an allele varies genera- 
tion after generation. The Fisher- Wright theory [12, 13] 
models drift as a stochastic evolutionary process with nei- 
ther selection nor mutation. It assumes random mating 
between individuals and that the population is held at 
a finite, constant size. Moreover, successive populations 
do not overlap in time. 

Under these assumptions the Fisher- Wright theory 
reduces drift to a binomial or multinomial sampling 
process — a more complicated version of familiar random 
walks such as Gambler's Ruin or Prisoner's Escape [17]. 



Offspring receive either the wild-type ahele Ai or the 
mutant allele A2 of a particular gene A from a random 
parent in the previous generation with replacement. A 
population of N diploid-'^ individuals will have 2N total 
copies of these alleles. Given i initial copies of A2 in the 
population, an individual has either A2 with probability 
i/2N or Ai with probability l — i/2N. The probability that 
j copies of A2 exist in the offspring's generation given i 
copies in the parent's generation is: 



Ptj 



2N 



2N 



2NJ 



2N-j 



(1) 



This specifies the transition dynamic of the drift 
stochastic process over the discrete state space 

{o,yN,...,N-i/N,i}. 

This model of genetic drift is a discrete-time random 
walk, driven by samples of a biased coin, over the space 
of biases. The population is a set of coin flips, where 
the probability of Heads or Tails is determined by the 
coin's current bias. After each generation of flips, the 
coin's bias is updated to reflect the number of Heads or 
Tails realized in the new generation. The walk's absorb- 
ing states — all Heads or all Tails — capture the notion 
of fixation and deletion. 



IV. GENETIC FIXATION 

Fixation occurs with respect to an allele when all in- 
dividuals in the population carry that specific allele and 
none of its variants. Restated, a mutant allele A2 reaches 
fixation when all 27V alleles in the population are copies 
of A2 and, consequently, Ai has been deleted from the 
population. This halts the random fluctuations in the 
frequency of A2, assuming Ai is not reintroduced. 

Let X be a binomially distributed random variable 
with bias probability p that represents the fraction of 
copies of A2 in the population. The expected number 
of copies of A2 is E[X] = 2Np. That is, the expected 
number of copies of A2 remains constant over time and 
depends only on its initial probability p and the total 
number {2N) of alleles in the population. However, A2 
eventually reaches fixation or deletion due to the change 
in allele frequency introduced by random sampling and 
the presence of absorbing states. Prior to fixation, the 
mean and variance of the change in allele frequency Ap 
are: 



E[Ap] 
Var[Ap] 



and 

P(l-p) 

27V '■ 



(2) 
(3) 



^ Though we first use diploid populations (two alleles per individ- 
ual and thus a sample length of 2A') for direct comparison to 
previous work, we later transition to haploid (single allele per 
individual) populations for notational simplicity. 



respectively. 

On average there is no change in frequency. However, 
sampling variance causes the process to drift towards the 
absorbing states at p = and p = 1. The drift rate is 
determined by the current generation's allele frequency 
and the total number of alleles. For the neutrally selec- 
tive case, the average number of generations until fixation 
(ti) or deletion (ig) is given by Kimura and Ohta [5]: 

ti(p) - -- [47Ve(l -p)log(l -p)] and (4) 

P 



(5) 



to(p) - -47Ve ( ^ ) logp 



where 7Ve denotes effective population size. For simplic- 
ity we take 7Ve — 7V, meaning all individuals in the pop- 
ulation are candidates for reproduction. As p — > 0, the 
boundary condition is given by: 



ti(0)=47Ve . 



(6) 



That is, excluding cases of deletion, an initially rare mu- 
tant allele spreads to the entire population in 47Ve gen- 
erations. 

One important consequence of the theory is that when 
fixation {p = 1) or deletion (p = 0) are reached, variation 
in the population vanishes: Var[Ap] = 0. With no varia- 
tion there is a homogeneous population, and sampling 
from this population produces the same homogeneous 
population. In other words, this establishes fixation and 
deletion as absorbing states of the stochastic sampling 
process. Once there, drift stops. 

Figure 1 illustrates this, showing both the simulated 
and theoretically predicted number of generations until 
fixation occurs for 7V = 10, as well as the predicted time 
to deletion for reference. Each simulation was performed 
for a different initial value of p and averaged over 400 
realizations. Using the same methodology as Kimura and 
Ohta [5] , we include only those realizations whose mutant 
allele reaches fixation. 

Populations are produced by repeated binomial sam- 
pling of 27V uniform random numbers between and 1. 
An initial probability 1 — p is assigned to allele Ai and 
probability p to allele A2. The count i of A2 in the initial 
population is incremented for each random number less 
than p. This represents an individual acquiring the allele 
A2 instead of Ai. The maximum likelihood estimate of 
allele frequency in the initial sample is simply the num- 
ber of A2 alleles over the sample length: p ~ i/2N. This 
estimate of p is then used to generate a new population 
of offspring, after which we re-estimate the value of p. 
These steps are repeated each generation until fixation 
at p = 1 or deletion at p = occurs. This is the Monte 
Carlo (MC) sampling method. 

Kimura's theory and simulations predict the time to 
fixation or deletion of a mutant allele in a finite popula- 
tion by the process of genetic drift. The Fisher- Wright 
model and Kimura's theory assume a memoryless popu- 
lation in which each offspring inherits allele Ai or A2 via 



O 



■ Fixation (MC) 
Deletion (io Theory) 



Fixation (ii Theory) 



40 - 



30 



20 - 



10 



- 




0.2 0.4 0.6 0.: 

Initial Probability p 



a 



-•- Stasis (MC) 

Fixation (ii Theory) ■ 



■ Stasis (SD) 
Deletion (to Theory) 



2,000 



1,500 



1,000 



500 




0.2 0.4 0.6 0.8 

Initial Probability 



FIG. 1. Time to fixation for a population oi N = 10 individu- 
als (sample size 2N — 20) plotted as a function of initial allele 
probability p under the Monte Carlo (MC) sampling regime 
and as given by theoretical prediction (solid line) of Eq. (4). 
Time to deletion is also shown (dashed line), Eq. (5). 



FIG. 2. Time to stasis as a function of initial Pr[HEADS] 
for structural drift (SD) of the Biased Coin Process versus 
Monte Carlo (MC) simulation of Kimura's model. Kimura's 
predicted times to fixation and deletion are shown for refer- 
ence. Each estimated time is averaged over 100 realizations 
with sample size A'^ — 1000. 



an IID binomial sampling process. We now generalize 
this to memoryful stochastic processes, giving a new def- 
inition of fixation and exploring examples of structural 
drift behavior. 



V. SEQUENTIAL LEARNING 

How can genetic drift be a memoryful stochastic pro- 
cess? Consider a population of N haploid organisms. 
Each generation consists of N alleles and so is repre- 
sented by a string of N symbols, e.g. A1A2 . . . AiAi, 
where each symbol corresponds to an individual with a 
particular allele. In the original drift models, a genera- 
tion of offspring is produced by a memoryless binomial 
sampling process, selecting an offspring's allele from a 
parent with replacement. In contrast, the structural drift 
model produces a generation of individuals in which the 
sample order is tracked. The population is now a string of 
alleles, giving the potential for memory and structure in 
sampling — spatial, temporal, or other interdependencies 
between individuals within a sample. 

At first, this appears as a major difference from the 
usual setting employed in evolutionary biology, where 
populations are treated as unordered collections of in- 
dividuals and sampling is modeled as an independent, 
identically distributed stochastic process. That said, the 
structure we have in mind has several biological interpre- 
tations, such as inbreeding and subdivision [18] or the 
life histories of heterogeneous populations [19]. We later 



return to these alternative interpretations when consid- 
ering applications. 

The model class we select to describe memoryful sam- 
pling is the e-machine: the unique, minimal, and opti- 
mal representation of a stochastic process [4] . As we will 
show, these properties give an important advantage when 
analyzing structural drift, since they allow one to moni- 
tor the amount of structure innovated or lost during drift. 
We next give a brief overview of e-machines and refer the 
reader to the previous reference for details. 

e-Machine representations of the finite-memory discrete- 
valued stochastic processes we consider here form a class 
of (deterministic) probabilistic finite-state machine or 
unifilar hidden Markov model. An e-machine consists 
of a set of causal states S — {0, 1, . . . , /c — 1} and a set of 
per-symbol transition matrices: 



{T^^^^-.aeA} 



(7) 



where A = {Ai, . . . , A,„} is the set of alleles and where 

the transition probability T^ ° gives the probability of 
transitioning from causal state Si to causal state Sj and 
emitting allele a. The causal state probability Pr((T), 
CT e «S, is determined as the left eigenvector of the state- 
to-state transition matrix T = X^ae^^'""''- 

Maintaining our connection to (haploid) popu- 
lation dynamics, we think of an e-machine as 
a generator of populations or length-A^ strings: 
a^ — aia2 ■ ■ ■ tti . . . a at, at G A. As a model of a sampling 
process, an e-machine gives the most compact representa- 
tion of the distribution of strings produced by sampling. 




FIG. 3. e-Machine for the Alternating Process consisting of 
two causal states <S — {A,B} and two transitions. Each tran- 
sition is labeled p | a to indicate the probability p — T^ " of 
taking that transition and emitting allele a £ A. State A 
emits allele with probability one and transitions to state B, 
while B emits allele 1 with probability one and transitions to 
A. 




FIG. 4. The e-machino for the Golden Mean Process that 
generates a population with no consecutive Os. In state A the 
probabilities of generating a or 1 are p and 1—p, respectively. 



repetition of this step creates a sequential communica- 
tion chain. Sequential learning is thus closely related to 
genetic drift except that sample order is tracked, and this 
order is used in estimating the next model. 

The procedure is analogous to flipping a biased coin 
a number of times, estimating the bias from the results, 
and re-flipping the newly biased coin. Eventually, the 
coin will be completely biased towards Heads or Tails. 
In our drift model the coin is replaced by an e-machine, 
which removes the IID model constraint and allows for 
the sampling process to take on structure and memory. 
Not only do the transition probabilities T^"" change, but 
the structure of the model itself — the number of states 
and the presence or absence of transitions — drifts over 
time to capture the statistics of the sample using as little 
information as possible. This is an essential and distinc- 
tive aspect of structural drift. 

Before we can explore this dynamic, we first need to 
examine how an e-machine reaches fixation or deletion. 



Consider a simple binary process that alternately gen- 
erates Os and Is called the Alternating Process shown in 
Fig. 3. Its e-machine generates either the string 0101 . . . 
or 1010 . . . depending on the start state. The per-symbol 
transition matrices are: 



2^(0) ^ 



0.0 1.0 

0.0 0.0 

0.0 0.0 

1.0 0.0 



and 



(8) 
(9) 



Enforcing the alternating period-2 pattern requires two 
states, A and B, as well as two positive probability tran- 



sitions T 



(0) 



AB 



1.0 and T: 



(1) 



BA 



1.0. 



We are now ready to describe sequential learning, de- 
picted in Fig. 5. We begin by selecting an initial pop- 
ulation generator Mq — an e-machine. Following a path 
through Mq, guided by its transition probabilities, pro- 
duces a length- A^ string a^ = ai . . . qn that represents 
the first population of N individuals possessing alleles 
ai G A. We then infer an e-machine Mi from the pop- 
ulation a^ . Ml is then used to produce a new pop- 
ulation a^ , from which a new e-machine M2 is esti- 
mated. This new population has the same allele dis- 
tribution as the previous, plus some amount of vari- 
ance. The cycle of inference and re-inference is repeated 
while allele frequencies drift each generation until fixa- 
tion or deletion is reached. At that point, the popula- 
tions (and so e-machines) cannot vary further. The net 
result is a stochastically varying time series of e-machincs 
(Mo, Ml, M2, . . .) that terminates when the populations 
af stop changing. 

Thus, at each step a new representation or model is 
estimated from the previous step's sample. The infer- 
ence step highlights that this is learning: a model of the 
generator is estimated from the given finite data. The 



VI. STRUCTURAL STASIS 

Recall the Alternating Process from Fig. 3, produc- 
ing the strings 0101 . . . and 1010 . . . depending on the 
start state. Regardless of the initial state, the original 
e-machine is re-inferred from any sufficiently long string 
it produces. In the context of sequential learning, this 
means the population at each generation is the same. 

However, if we consider allele Ai to be represented 
by symbol and A2 by symbol 1, neither allele reaches 
fixation or deletion according to current definitions. 
Nonetheless, the Alternating Process prevents any vari- 
ance between generations and so, despite the population 
not being all Os or all Is, the population does reach an 
equilibrium: half Os and half Is. For these reasons, one 
cannot use the original population-dynamics definitions 
of fixation and deletion. 



Mo 



-► Qq = ao ai • • • ajv 
Generate 
alleles 

Infer 
e-machine 



Ml 



>■ ai = ao di • • • ajv 

Generate 
alleles 

Infer 
e-machine 

Y 
M2 



FIG. 5. Sequential inference of a chain of e-machines. An 
initial population generator Mo produces a length-A'' string 
a^ = fli . . . ajv from which a new model Mi is inferred. These 
steps are repeated using Mi as the population generator and 
so on, until a terminating condition is met. 



This leads us to introduce structural stasis to com- 
bine the notions of fixation, deletion, and the inability to 
vary caused by periodicity. Said more directly, structural 
stasis corresponds to a process becoming nonstochastic, 
since it ceases to introduce variance between generations 
and so prevents further drift. However, we need a method 
to detect the occurrence of structural stasis in a drift pro- 
cess. 

A state machine representing a periodic sampling pro- 
cess enforces the constraint of periodicity via its internal 
memory. One measure of this memory is the population 
diversity H{N) [20]: 

H{N)=H[Ai...An] (10) 

= - ^ Pr(a^)log2Pr(a^), (11) 

where the units are [bits].^ The population diversity of 
the Alternating Process is H{N) = 1 bit at any size 
A^ :^ 1. This single bit of information corresponds to 
the machine's current phase or state. Generally, though, 
the value diverges — H{N) oc N — for arbitrary sampling 
processes, and so population diversity is not suitable as 
a general test for stasis. 

Instead, the condition for stasis can be given as the 
vanishing of the growth rate of population diversity: 



h^ = lim \H(N) 



H{N-l)\. 



(12) 



Definition. Structural stasis occurs when the sampling 
process's allelic entropy vanishes: ft,^ = 0. 

Proposition 1. Structural stasis is a fixed point of 
finite-memory structural drift. 

Proof. Finite-memory means that the e-machine repre- 
senting the population sampling process has a finite num- 
ber of states. Given this, if h^ = 0, then the e-machine 
has no branching in its recurrent states: T^" = or 1, 
where Si and Sj are asymptotically recurrent states. This 
results in no variation in the inferred e-machine when 
sampling sufficiently large populations. Lack of varia- 
tion, in turn, means that Ap == and so the drift process 
stops. If allelic entropy vanishes at time t and no mu- 
tations are allowed, then it is zero for all t' > t. Thus, 
structural stasis is an absorbing state of the drift .stochas- 
tic process. 



VII. EXAMPLES 

While more can be said analytically about structural 
drift, our present purpose is to introduce the main con- 
cepts. We will show that structural drift leads to inter- 
esting and nontrivial behavior. First, we calibrate the 
new class of drift processes against the original genetic 
drift theory. 



Equivalcntly, we can test the per-allele entropy of the 
sampling process. We call this allelic entropy: 



lim 

Af-5-OO 



H{N) 

N 



(13) 



where the units are [bits per allele] . Allelic entropy gives 
the average information per allele in bits, and structural 
stasis occurs when /i^ = 0. While closer to a general test 
for stasis, this quantity is difficult to estimate from pop- 
ulation samples since it relies on an asymptotic estimate 
of the population diversity. However, the allelic entropy 
can be calculated in closed-form from the e-machinc rep- 
resentation of the sampling process: 
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(14) 
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When hfj^ = 0, the sampling process has become peri- 
odic and lost all randomness generated via its branching 
transitions. This new criterion subsumes the notions of 
fixation and deletion as well as periodicity. An e-machine 
has zero allelic entropy if any of these conditions occur. 
More formally, we have the following statement. 



^ For background on information theory as used here, the reader 
is referred to Rcf. [21]. 



A. Memory less Drift 

The Biased Coin Process is represented by a single- 
state e-machine with a self loop for both Heads and 
Tails symbols. It is an IID sampling process that gener- 
ates populations with a binomial distribution. Unlike the 
Alternating Process, the coin's bias p is free to drift dur- 
ing sequential inference. These properties make the Bi- 
ased Coin Process an ideal candidate for exploring mem- 
ory less drift. 

Fig. 6 shows structural drift, using two different mea- 
sures, for a single realization of the Biased Coin Process 
with initial p = Pr[HEADS] = Pr[TAiLS] = 0.5. Struc- 
tural stasis (/i^ = 0) is reached after 115 generations. 
The initial Fair Coin e-machine occurs at the left of Fig. 
6 and the final, completely biased e-machine occurs at 
the right. 

Note that the drift of allelic entropy h^ and 
p = Pr[TAiLS] arc inversely related, with allelic entropy 
converging quickly to zero as stasis is approached. This 
reflects the rapid drop in population diversity. After sta- 
sis occurs, all randomness has been eliminated from the 
transitions at state A, resulting in a single transition that 
always produces Tails. Anticipating later discussion, we 
note that during this run only Biased Coin Processes were 
observed. 

The time to stasis of the Biased Coin Process as a func- 
tion of initial p — Pr[HEADS] was shown in Fig. 2. Also 
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FIG. 6. Drift of allelic entropy ftp and Pr[HEADS] for a 
single realization of the Biased Coin Process with sample 
size A^ — 100. The drift of Pr[HEADS] is annotated with its 
initial machine Mq (left inset) and the machine at stasis Afus 
(right inset). 



FIG. 7. Drift of Pr[HEADS] for a single realization of the 
Biased Coin, Golden Mean, and Even Processes, plotted as a 
function of generation. The Even and Biased Coin Processes 
become the Fixed Coin Process at stasis, while the Golden 
Mean Process becomes the Alternating Process. Note that the 
definition of structural stasis recognizes the lack of variance 
in the Alternating Process subspace even though the allele 
probability is neither nor 1. 



shown there was the previous Monte Carlo Kiniura drift 
simulation modified to terminate when either fixation or 
deletion occurs. This experiment illustrates the defini- 
tion of structural stasis and allows direct comparison of 
structural drift with genetic drift in the memoryless case. 
Not surprisingly, we can interpret genetic drift as a 
special case of the structural drift process for the Bi- 
ased Coin. Both simulations follow Kimura's theoreti- 
cally predicted curves, combining the lower half of the 
deletion curve with the upper half of the fixation curve 
to reflect the initial probability's proximity to the absorb- 
ing states. A high or low initial bias leads to a shorter 
time to stasis as the absorbing states are closer to the 
initial state. Similarly, a Fair Coin is the furthest from 
absorption and thus takes the longest average time to 
reach stasis. 



B. Structural Drift 

The Biased Coin Process represents an IID sampling 
process with no memory of previous flips, reaching sta- 
sis when Pr[HEADS] — 1.0 or 0.0 and, correspondingly, 
when hfi{Mt) = 0.0. We now introduce memory by start- 
ing drift with Mq as the Golden Mean Process, which 
produces binary populations with no consecutive Os. Its 
e-machine was shown in Fig. 4. Note that one can ini- 
tialize drift using any stochastic process; for example, see 



the e-machine library of Ref. [22] . 

Like the Alternating Process, the Golden Mean Pro- 
cess has two causal states. However, the transitions from 
state A have nonzero entropy, allowing their probabili- 
ties to drift as new e-machines are inferred from genera- 
tion to generation. If the A ^ B transition probability p 
(Fig. 4) becomes zero the transition is removed, and the 
Golden Mean Process reaches stasis by transforming into 
the Fixed Coin Process (top right. Fig. 6). Instead, if 
the same transition drifts towards probability p = 1, the 
A^ A transition is removed. In this case, the Golden 
Mean Process reaches stasis by transforming into the Al- 
ternating Process (Fig. 3). 

To compare structural drift behaviors, consider also 
the Even Process. Similar in form to the Golden Mean 
Process, the Even Process produces populations in which 
blocks of consecutive Is must be even in length when 
bounded by Os [21]. Figure 7 compares the drift of 
Pr [Heads] for a single realization of the Biased Coin, 
Golden Mean, and Even Processes. One observes that 
the Even and Biased Coin Processes reach stasis as the 
Fixed Coin Process, while the Golden Mean Process 
reaches stasis as the Alternating Process. For different 
realizations, the Even and Golden Mean Processes might 
instead reach different stasis points. 

It should be noted that the memoryful Golden Mean 
and Even Processes reach stasis markedly faster than the 
memoryless Biased Coin. While Fig. 7 shows only a sin- 
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gle realization of each sampling process type, the top 
panel of Fig. 9 shows the large disparity in stasis times 
holds across all settings of each process's initial bias. This 
is one of our first general observations about memoryful 
processes: The structure of memoryful processes substan- 
tially impacts the average time to stasis by increasing 
variance between generations. In the cases shown, time 
to stasis is greatly shortened. 



VIII. ISOSTRUCTURAL SUBSPACES 



A. Subspace Diffusion 

To illustrate the richness of structural drift and to un- 
derstand how it affects average time to stasis, we ex- 
amine the complexity-entropy (CE) diagram [23] of the 
e-machines produced over several realizations of an arbi- 
trary sampling process. The CE diagram displays how 
the allelic entropy h^ of an e-machine varies with the 
allelic complexity C^ of its causal states: 



C^ = -}^ Pr(a) log2 Pr(a) 

aes 



(15) 



where the units are [bits]. The allelic complexity is the 
Shannon entropy over an e-machine's stationary state 
distribution Pr(5). It measures the memory needed to 
maintain the internal state while producing stochastic 
outputs. e-Machine minimality guarantees that C^ is 
the smallest amount of memory required to do so. Since 
there is a one-to-one correspondence between processes 
and their e-machines, a CE diagram is a projection of 
process space onto the two coordinates (ft.^, C^). Used in 
tandem, these two properties differentiate many types of 



sampling process, capturing both their intrinsic memory 
(Cp) and the diversity (h^) of populations they generate. 

Two such CE diagrams are shown in Fig. 8, illus- 
trating different subspaces and stasis points reachable by 
the Golden Mean Process during structural drift. Con- 
sider the left panel first. An e-machine reaches stasis 
by transforming into either the Fixed Coin or the Al- 
ternating Process. To reach the former, the e-machine 
begins on the upper curve in the left panel and drifts un- 
til the A ^f B transition probability nears zero and the 
inference algorithm decides to merge states in the next 
generation. This forces the e-machinc to jump to the 
Biased Coin subspace on the line C^ — where it will 
most likely diffuse until the Fixed Coin stasis point at 
{hfijCfj^) = (0, 0) is reached. If instead the A ^f B tran- 
sition probability drifts towards zero, the Golden Mean 
stays on the upper curve until reaching the Alternat- 
ing Process stasis point at {h^,C^) = (0,1). Thus, the 
two stasis points are differentiated not by h^ but by C^, 
with the Alternating Process requiring 1 bit of memory 
to track its internal state and the Biased Coin Process 
requiring none. 

What emerges from these diagrams is a broader view 
of how population structure drifts in process space. 
Roughly, the Mt diffuse locally in the parameter space 
specified by the current, fixed architecture of states and 
transitions. During this, transition probability estimates 
vary stochastically due to sampling variance. Since C^ 
and h^ are continuous functions of the transition proba- 
bilities, this variance causes the Mt to fall on well defined 
curves or regions corresponding to a particular process 
subspace. (See Figs. 4 and 5 in Ref. [23] and the theory 
for these curves and regions there.) 

We refer to these curves as isostructural curves and 
the associated sets of e-machines as isostructural sub- 



spaces. They are metastable subspaces of sampling 
processes that are quasi-invariant under the structural 
drift dynamic. When one or more e-machine param- 
eters diffuse sufficiently so that inference is forced to 
change topology by adding or removing states or tran- 
sitions to reflect the statistics of the sample, this quasi- 
invariance is broken. We call such topological shifts sub- 
space jumps to reflect the new region occupied by the 
resulting e-machine in process space, as visualized by the 
CE diagram. Movement between subspaces is often not 
bidirectional — innovations from a previous topology may 
be lost either temporarily (when the innovation can be re- 
stored by returning to the subspace) or permanently. For 
example, the Golden Mean subspace commonly jumps to 
the Biased Coin subspace but the opposite is highly im- 
probable without mutation. (We consider the latter type 
of structured drift elsewhere.) 

Before describing the diversity seen in the CE diagram 
of Fig. 8's right panel, we first turn to analyze in some de- 
tail the time-to-stasis underlying the behavior illustrated 
in the left panel. 



B. Subspace Decomposition 

A pathway is a set of subspaces passed through by any 
drift realization starting from some initial process and 
reaching a specific stasis point. The time to stasis of 
a drift process V is the sum of time spent in the sub- 
spaces 7 visited by its pathways to stasis p, weighted by 
the probabilities that these pathways and subspaces will 
be reached. The time spent in a subspace 7i+i only de- 
pends on the transition parameter(s) of the e-machine 
at the time of entry and is otherwise independent of the 
prior subspace 7^. Thus, calculating the stasis time of 
a structured population can be broken down into inde- 
pendent subspace times when we know the values of the 
transition parameters at subspace jumps. These values 
can be derived both empirically and analytically, and we 
aim to develop the latter for general drift processes in 
future work. 

More formally, the time-to-stasis tg of a drift process V 
is simply the weighted sum of the stasis times for its 
connected pathways p: 



\p\ 



t,(P)-^Pr(p,|P)f,(ft|P), 



(16) 



i=l 



Similarly, the stasis time of a particular pathway decom- 
poses into the time spent diffusing in its connected sub- 
spaces 7: 

l7l 

tsip^\V) = ^Pr(7,|ft,7')i(7»|ft,^) • (17) 



To demonstrate. Fig. 9 shows the stasis time of the 
Golden Mean Process (GMP) with initial bias po in more 
detail. Regression lines along with their 95% confidence 
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intervals are displayed for simulations with initial biases 
Po G {0.1, 0.2, . . . , 0.9}. The middle panel shows the to- 
tal time-to-stasis as the weighted sum of its Fixed Coin 
(FC) and Alternating Process (AP) pathways: 

t«(GMP(po))-Pr(FC|GMP(po))is(FC|GMP(po)) 
+ Pr(AP|GMP(po))ts(AP|GMP(po)) • 

For low pq, the transition from state A to state B is 
unlikely, so Os are rare and the AP pathway is reached 
infrequently. Thus, the total stasis time is initially dom- 
inated by the FC pathway (Pr(FC|GMP(po)) is high). 
As pq — >■ 0.3 and above, the AP pathway is reached more 
frequently (Pr(AP|GMP(po)) grows) and its stasis time 
begins to influence the total. The FC pathway is less 
likely as po — >■ 0.6 and the total time becomes dominated 
by the AP pathway (Pr(AP|GMP(po)) is high). 

Since the AP pathway visits only one subspace, the 
bottom panel shows the stasis time of the FC pathway as 
the weighted sum of the Golden Mean (CM) and Biased 
Coin (BC) subspace times: 

t,(FC|GMP(po)) = 

Pr(GM|FC, GMP(po))i(GM|FC, GMP(po)) + 
Pr(BC|FC,GMP(po))t(BC|FC,GMP(po)) • (18) 

This corresponds to time spent diffusing in the GM sub- 
space before the subspace jump and time spent diffusing 
in the BC subspace after the subspace jump. Note that 
the times quoted are simply diffusion times within a sub- 
space, since not every subspace in a pathway contains a 
stasis point. 

These expressions emphasize the dependence of stasis 
time on the transition parameters at jump points as well 
as on the architecture of isostructural subspaces in drift 
process space. For example, if the GM jumps to the BC 
subspace at p ~ 0.5, the stasis time will be large since 
the e-machine is maximally far from either stasis point. 
However, the inference algorithm will typically jump at 
very low values of p resulting in a small average stasis 
time for the BC subspace in the FC pathway. Due to 
this, calculating the stasis time for the GMP requires 
knowing the AP and FC pathways as well as the value of 
p where the GM — > BC jump occurs. 



C. Structural Innovation and Loss 

Inference of e-machines from finite populations is com- 
putationally expensive, particularly in a sequential set- 
ting with many realizations. The topology of the 
e-machine is inferred directly from the statistics of finite 
samples; both states and transitions are added and re- 
moved over time to capture innovation and loss of popula- 
tion structure. In the spirit of Kimura's pseudo-sampling 
variable method [24], we introduce a pseudo-drift algo- 
rithm for efficient drift simulation and increased control 
of the trade-off between structural innovation and loss. 



Instead of inferring and re-inferring an e-machine each 
generation, we explicitly define the conditions for topo- 
logical changes to the e-machine of the previous genera- 
tion. To test for structural innovation, a random causal 
state from the current Mt is cloned and random incoming 
transitions are routed instead to the cloned state. This 
creates a new model M^ that describes the same process. 
Gaussian noise is then added to the cloned state's outgo- 
ing transitions to represent change in population struc- 
ture. The likelihood of the population a^ is calculated 
for both Mt and M^ and the model with the maximum 
a posteriori (MAP) likelihood is retained: 

Mmap = argmax{Pr(af |M0, Pr(af |M;)} . (19) 

If the original Mt was retained, its transition parameters 
are updated by feeding the sample through the model to 
obtain edge counts which are normalized to obtain prob- 
abilities. This produces a generator for the next genera- 
tion's population in a way that allows for innovation. As 
well, it side-steps the computational cost of the inference 
algorithm. 

To capture structural loss, we monitor near-zero tran- 
sition probabilities where the e-machine inference algo- 
rithm would merge states. When such a transition ex- 
ists we test for structural simplification by considering all 
pairwise mergings of causal states and select the topology 
via the MAP likelihood. However, unlike above, we pe- 
nalize likelihood using the Akaike Information Criterion 
(AIC) [25]: 



AIC = 2fc-21n(i) , 



(20) 



and, in particular, the AIC corrected for finite sample 
sizes [261: 



AICc = AIC 



2fc(fc + l) 



1 



(21) 



where k is the number of model parameters, L is the 
sample likelihood, and n is the sample size. A penal- 
ized likelihood is necessary because a smaller e-machine 
is more general and cannot fit the data as well. When 
penalized by model size, however, a smaller model with 
sufficient fit to the data may be selected over a larger, 
better fitting model. This method allows loss to occur 
while again avoiding the expense of the full e-machine 
inference algorithm. Extensive comparisons with several 
versions of the latter show that the new pseudo-drift al- 
gorithm produces qualitatively the same behavior. 

Having explained how the pseudo-drift algorithm intro- 
duces structural innovation and loss we can now describe 
the drift runs of Fig. 8's right panel. In contrast to the 
left panel, structural innovation was enabled. The imme- 
diate result is that the drift process visits a much wider 
diversity of isostructural subspaces — sampling processes 
that are markedly more complex. e-Machines with 8 or 
more states are created, some of which are quite entropic 
and so produce high sampling variance. Stasis e-machines 
with periods 3, 4, 5, and 6 are seen, while only those with 
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periods 1 and 2 arc seen in runs without innovation (left 
panel) . 

By way of closing this first discussion of structural 
drift, it should be emphasized that none of the preceding 
phenomena occur in the limit of infinite populations or 
infinite sample size. The variance due to finite sampling 
drives sequential learning, the diffusion through process 
space, and the jumps between isostructural subspaces. 



IX. APPLICATIONS AND EXTENSIONS 

Much of the previous discussion focused on structural 
drift as a kind of stochastic process, with examples and 
behaviors selected to emphasize the role of structure. Al- 
though there was a certain terminological bias toward 
neutral evolution theory since the latter provides an en- 
tree to analyzing how structural drift works, our presen- 
tation was intentionally general. Motivated by a vari- 
ety of potential applications and extensions, we describe 
these now and close with several summary remarks on 
structural drift itself. 



Emergent Semantics and Learning in 
Communication Chains 



Let's return to draw parallels with the opening ex- 
ample of the game of Telephone or, more directly, to 
the sequential inference of temporal structure in an ut- 
terance passed along a serially coupled communication 
chain. There appears to be no shortage of related theo- 
ries of language evolution. These range from the popula- 
tion dynamics of Ref. [27] and the ecological dynamics of 
Ref. [28] to the cataloging of error sources in human com- 
munication [29] and recent efforts to understand cultural 
evolution as reflecting learning biases [30, 31]. 

By way of contrast, structural drift captures the 
language-centric notion of dynamically changing seman- 
tics and demonstrates how behavior is driven by finite- 
sample fluctuations within a semantically organized sub- 
space. The symbols and words in the generated strings 
have a semantics given by the structure of a subspace's 
e-machine; see Ref. [3]. A particularly simple ex- 
ample was identified quite early in the information- 
theoretic analysis of natural language: The Golden Mean 
e-machine (Fig. 4) describes the role of isolated space 
symbols in written English [32, Fig. 1]. Notably, this 
structure is responsible for the Mandelbrot-Zipf power- 
law scaling of word frequencies [33, 34]. More gener- 
ally, though, the semantic theory of e-machincs shows 
that causal states provide dynamic contexts for interpre- 
tation as individual symbols and words arc recognized. 
Quantitatively, the allelic complexity Cfj,{Mt) is the to- 
tal amount of semantic content that can be generated by 
an Mt [3]. In this way, shifts in the architecture of the 
Mt during drift correspond to semantic changes. That 
is, diffusion within an isostructural subspace corresponds 



to constant semantics, while jumps between isostructural 
subspaces correspond to semantic innovations (or losses) . 
In the drift behaviors explored above, the Mt went 
to stasis {hfj_ — 0) corresponding to periodic formal lan- 
guages. Clearly, such a long-term condition falls far short 
as a model of human communication chains. The result- 
ing communications, though distant from those at the 
beginning of the chain, are not periodic. To more closely 
capture emergent semantics in the context of sequential 
language learning, we have extended structural drift to 
include mutation and selection. In future work we will 
use these extensions to investigate how the former pre- 
vents permanent stasis and the latter enables a preference 
for intelligible phrases. 



B. Cultural Evolution and Iterated Learning 

Extending these observations, the Iterated Learning 
Model (ILM) of language evolution [35, 36] is of partic- 
ular interest. In this model, a language evolves by re- 
peated production and acquisition by agents under cul- 
tural pressures and the "poverty of the stimulus" [35]. 
Via this process, language is effectively forced through a 
transmission bottleneck that requires the learning agent 
to generalize from finite data. This, in turn, exerts pres- 
sure on the language to adapt to the bias of the learner. 
Thus, in contrast to traditional views that the human 
brain evolved to learn language, ILM suggests that lan- 
guage also adapts to be learnable by the human brain. 

ILM incorporates the sequential learning and propaga- 
tion of error we discuss here and provides valuable insight 
into the effects of error and cultural mutations on the evo- 
lution of language for the "human niche" . There are var- 
ious simulation approaches to ILM with both single and 
multiple agents based on, for example, neural networks 
and Bayesian inference, as well as experiments with hu- 
man subjects. We suggest that structural drift could also 
serve as the basis for single-agent ILM experiments, as 
found in Swarup et al. [37], where populations of alle- 
les in the former are replaced by linguistic features of 
the latter. The benefits are compelling: an information- 
theoretic framework for quantifying the trade-off between 
learner bias and transmission bottleneck pressures, visu- 
alization of cultural evolution via the CE diagram, and 
decomposition of the time-to-stasis of linguistic features 
in terms of isostructural subspaces as presented above. 



C. Epochal Evolution 

Beyond applications to knowledge transmission via se- 
rial communication channels, structural drift gives an al- 
ternative view of drift processes in population genetics. 
In light of new kinds of evolutionary behavior, it reframes 
the original questions about underlying mechanisms and 
extends their scope to phenomena that exhibit memory 
in the sampling process or that derive from structure in 
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populations. Examples of the latter include niche con- 
struction [38], the effects of environmental toxins [39], 
changes in predation [40], and socio-political factors [41] 
where memory lies in the spatial distribution of popula- 
tions. In addition to these, several applications to areas 
beyond population genetics proper suggest themselves. 

An intriguing parallel exists between structural drift 
and the longstanding question about the origins of punc- 
tuated equilibrium [42] when modeled as the dynamics of 
epochal evolution [43, 44]. The possibility of evolution's 
intermittent progress — long periods of stasis punctuated 
by rapid change — dates back to Fisher's demonstration 
of metastability in drift processes with multiple alleles 
[13]. 

Epochal evolution, though, presented an alternative to 
the views of metastability posed by Fisher's model and 
Wright's adaptive landscapes [45]. Within epochal evo- 
lutionary theory, equivalence classes of genotype fitness, 
called subbasins, are connected by fitness-changing por- 
tals to other subbasins. A genotype is free to diffuse 
within its subbasin via selectively neutral mutations, un- 
til an advantageous mutation drives genotypes through a 
portal to a higher-fitness subbasin. An increasing num- 
ber of genotypes derive from this founder and diffuse in 
the new subbasin until another portal to higher fitness 
is discovered. Thus, the structure of the subbasin-portal 
architecture dictates the punctuated dynamics of evolu- 
tion. 

Given an adaptive system which learns structure by 
sampling its past organization, structural drift theory 
implies that its evolutionary dynamics are inevitably 
described by punctuated equilibria. Diffusion in an 
isostructural subspace corresponds to a period of struc- 
tured equilibrium in a subbasin and subspace jumps cor- 
respond to rapid innovation or loss of organization during 
the transit of a portal. In this way, structural drift es- 
tablishes a connection between evolutionary innovation 
and structural change, identifying the conditions for cre- 
ation or loss of organization. Extending structural drift 
to include mutation and selection will provide a theoret- 
ical framework for epochal evolution using any number 
of structural constraints in a population. 



D. Evolution of Graph-Structured Populations 

We focused primarily on the drift of sequentially or- 
dered populations in which the generator (an e-machine) 
captured the structure and randomness in that ordering. 
We aimed to show that a population's organization plays 
a crucial role in its dynamics. This was, however, only 
one example of the general class of drift process we have 
in mind. For example, computational mechanics also de- 
scribes structure in spatially extended systems [46, 47]. 
Given this, it is straightforward to build a model of 
drift in geographically distributed populations that ex- 
hibit spatiotemporal structure. 

Though they have not tracked the structural complex- 



ity embedded in populations as we have done here, a 
number of investigations consider various classes of struc- 
tured populations. For example, the evolutionary dy- 
namics of structured populations have been studied using 
undirected graphs to represent correlations between in- 
dividuals. Edge weights Wij between individuals i and j 
give the probability that i will replace j with its offspring 
when selected to reproduce. 

By studying fixation and selection behavior on differ- 
ent types of graphs, Lieberman et al found that graph 
structures can sometimes amplify or suppress the effects 
of selection, even guaranteeing the fixation of advan- 
tageous mutations [48]. Jain and Krishna [49] inves- 
tigated the evolution of directed graphs and the emer- 
gence of self-reinforcing autocatalytic networks of inter- 
action. They identified the attractors in these networks 
and demonstrated a diverse range of behaviors from the 
creation of structural complexity to its collapse and per- 
manent loss. 

Graph evolution is a model of population structure 
complementary to that presented by structural drift. In 
the latter, e-machine structure evolves over time with 
nodes representing equivalence classes of the distribu- 
tion of selectively neutral alleles. Unlike e-machines, the 
multinomial sampling of individuals in graph evolution 
is a memoryless process. A combined approach will al- 
low one to examine how amplification and suppression of 
selection and the emergence of autocatalysis are affected 
by external influences on the population structure. For 
example, this could include how a population uses tem- 
poral memory to maintain desirable properties in antici- 
pation of structural shifts in the environment. The result 
would provide a theory for niche construction in which a 
nonlinear dynamics of pattern formation spontaneously 
changes population structure. 



X. FINAL REMARKS 

The Fisher- Wright model of genetic drift can be viewed 
as a random walk of coin biases, a stochastic process that 
describes generational change in allele frequencies based 
on a strong statistical assumption: The sampling process 
is memoryless. Here, we developed a generalized struc- 
tural drift model that adds memory to the process and 
examined the consequences of such population sampling 
memory. 

Memoryful sampling is a substantial departure from 
modeling evolutionary processes with unordered popula- 
tions. Rather than view structural drift as a replacement 
for the well understood theory of genetic drift, and given 
that the latter is a special case of structurally drifting 
populations, we propose that it be seen as a new avenue 
for theoretical invention. Given its additional ties to lan- 
guage and cultural evolution, we believe it will provide a 
novel perspective on evolution in nonbiological domains, 
as well. 

The representation selected for the population sam- 
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pling mechanism was the class of probabihstic finite- 
state hidden Markov models called e-machines. We dis- 
cussed how a sequential chain of e-machines inferred and 
re-inferred from the finite data they generate parallels 
the drift of alleles in a finite population, using other- 
wise the same assumptions made by the Fisher- Wright 
model. The mathematical foundations developed for the 
latter and its related models provide a good deal of quan- 
titative, predictive power. Much of this has yet to be 
exploited. In concert with this, e-machine minimality 
allowed us to monitor information processing, informa- 
tion storage, and causal architecture during the drift pro- 
cess. We introduced the information-theoretic notion of 
structural stasis to combine the concepts of deletion, fixa- 
tion, and periodicity for drift processes. Generally, struc- 
tural stasis occurs when the population's allelic entropy 
vanishes — a quantity one can calculate in closed form due 
to the e-machine representation of the sampling process. 

We revisited Kimura and Ohta's early results measur- 
ing the time to fixation of drifting alleles and showed 
that the generalized structural drift process reproduces 
these well known results when staying within the mem- 
oryless sampling process subspace. Starting with struc- 
tured populations outside of that subspace led the sam- 
pling process to exhibit memory effects including struc- 
tural innovation and loss, complex transients, and greatly 
reduced stasis times. 

Simulations demonstrated how an e-machinc diffuses 
through isostructural process subspaces during sequen- 
tial learning. The result was a very complex time-to- 
stasis dependence on the initial probability parameter — 
much more complicated than Kimura's theory describes. 
Nonetheless, we showed that a process' time-to-stasis can 
be decomposed into sums over these independent sub- 
spaces. Moreover, the time spent in an isostructural 
subspace depends on the value of the e-machine prob- 
ability parameters at the time of entry. This suggests 
an extension to Kimura's theory for predicting the time 
to stasis for each isostructural component independently. 
Much of the phenomenological analysis was facilitated 
by the global view of drift process space given by the 
complexity-entropy diagram. 

Drift processes with memory generally describe the 
evolution of structured populations without mutation or 
selection. Nonetheless, we showed that structure leads 



to substantially shorter stasis times. This was seen in 
drifts starting with the Biased Coin and Golden Mean 
Processes, where the Golden Mean jumps into the Bi- 
ased Coin subspace close to an absorbing state. This 
suggests that even without selection, population struc- 
ture and sampling memory matter in evolutionary dy- 
namics. The temporal or spatial memory captured by 
the e-machine can be interpreted as nonrandom mating, 
reducing the effective population size N^ and, in doing 
so, increasing sampling variance. It also suggests that 
memoryless models restrict sequential learning and over- 
estimate stasis times for structured populations. 

We demonstrated how structural drift — diffusion, 
structural innovation and loss — are controlled by the ar- 
chitecture of connected isostructural subspaces. Many 
questions remain about these subspaces. What is the de- 
gree of subspace-jump irreversibility? Can we predict the 
likelihood of these jumps? What does the phase portrait 
of a drift process look like? Thus, to better understand 
structural drift, we need to analyze the high-level orga- 
nization of generalized drift process space. 

Fortunately, e-machines are in one-to-one correspon- 
dence with structured processes [22]. Thus, the pre- 
ceding question reduces to understanding the space of 
e-machines and how they can be connected by diffusion 
processes. Is the diffusion within each process subspace 
predicted by Kimura's theory or some simple variant? 
We have given preliminary evidence that it does. And 
so, there are reasons to be optimistic that in face of the 
open-ended complexity of structural drift, a good deal 
can be predicted analytically. And this, in turn, will lead 
to quantitative applications. 
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